2025-05-23-12-07
Logic-of-Thought: Empowering Large Language Models with Logic Programs for Solving Puzzles in Natural Language
Abstract
arXiv:2505.16114v1 Announce Type: new Abstract: Solving puzzles in natural language poses a long-standing challenge in AI. While large language models (LLMs) have recently shown impressive capabilities in a variety of tasks, they continue to struggle with complex puzzles that demand precise reasoning and exhaustive search. In this paper, we propose Logic-of-Thought (Logot), a novel framework that bridges LLMs with logic programming to address this problem. Our method leverages LLMs to translate puzzle rules and states into answer set programs (ASPs), the solution of which are then accurately and efficiently inferred by an ASP interpreter. This hybrid approach combines the natural language understanding of LLMs with the precise reasoning capabilities of logic programs. We evaluate our method on various grid puzzles and dynamic puzzles involving actions, demonstrating near-perfect accuracy across all tasks. Our code and data are available at: https://github.com/naiqili/Logic-of-Thought.
摘要
解决自然语言中的谜题是人工智能领域一项长期存在的挑战。尽管大型语言模型(LLM)近期在各类任务中展现出卓越性能,但其在需要精确推理和穷尽搜索的复杂谜题上仍存在困难。本文提出"逻辑思维"(Logot)这一创新框架,通过将LLM与逻辑编程相结合来解决该问题。我们的方法利用LLM将谜题规则和状态转换为答案集程序(ASP),随后由ASP解释器进行准确高效的推理求解。这种混合方法融合了LLM的自然语言理解能力与逻辑程序的精确推理优势。我们在多种网格谜题和涉及动作的动态谜题上评估本方法,所有任务均展现出接近完美的准确率。代码与数据详见:https://github.com/naiqili/Logic-of-Thought。
SPhyR: Spatial-Physical Reasoning Benchmark on Material Distribution
Abstract
arXiv:2505.16048v1 Announce Type: new Abstract: We introduce a novel dataset designed to benchmark the physical and spatial reasoning capabilities of Large Language Models (LLM) based on topology optimization, a method for computing optimal material distributions within a design space under prescribed loads and supports. In this dataset, LLMs are provided with conditions such as 2D boundary, applied forces and supports, and must reason about the resulting optimal material distribution. The dataset includes a variety of tasks, ranging from filling in masked regions within partial structures to predicting complete material distributions. Solving these tasks requires understanding the flow of forces and the required material distribution under given constraints, without access to simulation tools or explicit physical models, challenging models to reason about structural stability and spatial organization. Our dataset targets the evaluation of spatial and physical reasoning abilities in 2D settings, offering a complementary perspective to traditional language and logic benchmarks.
摘要
我们提出一个新颖的数据集,旨在基于拓扑优化方法评估大型语言模型(LLM)的物理与空间推理能力。该数据集通过给定二维边界、作用力及支撑条件,要求模型推理出最优材料分布。数据集包含多种任务类型,包括补全局部结构中的掩蔽区域,以及预测完整材料分布等。解决这些任务需要理解给定约束条件下的力流传递与材料分布需求,且不依赖仿真工具或显式物理模型,从而对模型的结构稳定性与空间组织推理能力形成挑战。本数据集专注于二维环境下的空间与物理推理能力评估,为传统语言和逻辑基准测试提供了补充性视角。
Causal LLM Routing: End-to-End Regret Minimization from Observational Data
Abstract
arXiv:2505.16037v1 Announce Type: new Abstract: LLM routing aims to select the most appropriate model for each query, balancing competing performance metrics such as accuracy and cost across a pool of language models. Prior approaches typically adopt a decoupled strategy, where the metrics are first predicted and the model is then selected based on these estimates. This setup is prone to compounding errors and often relies on full-feedback data, where each query is evaluated by all candidate models, which is costly to obtain and maintain in practice. In contrast, we learn from observational data, which records only the outcome of the model actually deployed. We propose a causal end-to-end framework that learns routing policies by minimizing decision-making regret from observational data. To enable efficient optimization, we introduce two theoretically grounded surrogate objectives: a classification-based upper bound, and a softmax-weighted regret approximation shown to recover the optimal policy at convergence. We further extend our framework to handle heterogeneous cost preferences via an interval-conditioned architecture. Experiments on public benchmarks show that our method outperforms existing baselines, achieving state-of-the-art performance across different embedding models.
摘要
大语言模型路由(LLM routing)旨在为每个查询选择最合适的模型,在语言模型池中平衡准确性与成本等竞争性性能指标。现有方法通常采用解耦策略,即先预测各项指标,再基于这些估计值选择模型。这种设置容易导致误差累积,且往往依赖全反馈数据(即每个查询需经所有候选模型评估),其获取和维护成本高昂。与之相反,我们利用观察数据(仅记录实际部署模型的输出结果)进行学习。本文提出一个因果端到端框架,通过最小化观察数据中的决策遗憾来学习路由策略。为实现高效优化,我们引入两个理论完备的替代目标:基于分类的上界,以及经证明能在收敛时恢复最优策略的softmax加权遗憾近似。我们进一步通过区间条件架构扩展框架以处理异构成本偏好。公开基准测试表明,本方法优于现有基线,在不同嵌入模型上均达到最先进性能。
Optimizing LLM-Based Multi-Agent System with Textual Feedback: A Case Study on Software Development
Abstract
arXiv:2505.16086v1 Announce Type: new Abstract: We have seen remarkable progress in large language models (LLMs) empowered multi-agent systems solving complex tasks necessitating cooperation among experts with diverse skills. However, optimizing LLM-based multi-agent systems remains challenging. In this work, we perform an empirical case study on group optimization of role-based multi-agent systems utilizing natural language feedback for challenging software development tasks under various evaluation dimensions. We propose a two-step agent prompts optimization pipeline: identifying underperforming agents with their failure explanations utilizing textual feedback and then optimizing system prompts of identified agents utilizing failure explanations. We then study the impact of various optimization settings on system performance with two comparison groups: online against offline optimization and individual against group optimization. For group optimization, we study two prompting strategies: one-pass and multi-pass prompting optimizations. Overall, we demonstrate the effectiveness of our optimization method for role-based multi-agent systems tackling software development tasks evaluated on diverse evaluation dimensions, and we investigate the impact of diverse optimization settings on group behaviors of the multi-agent systems to provide practical insights for future development.
摘要
我们观察到基于大语言模型(LLM)的多智能体系统在解决需要不同领域专家协作的复杂任务方面取得了显著进展。然而,LLM驱动的多智能体系统优化仍具挑战性。本研究通过实证案例,探讨了在软件开发任务中利用自然语言反馈对基于角色的多智能体系统进行群体优化的效果,并从多个评估维度展开分析。我们提出了一种两阶段的智能体提示优化流程:首先通过文本反馈识别表现欠佳的智能体及其失败原因,随后根据失败解释对已识别智能体的系统提示进行优化。通过设置在线与离线优化、个体与群体优化两组对比实验,我们研究了不同优化设置对系统性能的影响。在群体优化方面,我们比较了单轮提示与多轮提示两种优化策略。实验结果表明,该方法能有效提升基于角色的多智能体系统在软件开发任务中的表现,且在不同评估维度上均显示出优化效果。此外,我们还探究了不同优化设置对多智能体系统群体行为的影响,为未来研究提供了实践启示。
LLM-Powered AI Agent Systems and Their Applications in Industry
Abstract
arXiv:2505.16120v1 Announce Type: new Abstract: The emergence of Large Language Models (LLMs) has reshaped agent systems. Unlike traditional rule-based agents with limited task scope, LLM-powered agents offer greater flexibility, cross-domain reasoning, and natural language interaction. Moreover, with the integration of multi-modal LLMs, current agent systems are highly capable of processing diverse data modalities, including text, images, audio, and structured tabular data, enabling richer and more adaptive real-world behavior. This paper comprehensively examines the evolution of agent systems from the pre-LLM era to current LLM-powered architectures. We categorize agent systems into software-based, physical, and adaptive hybrid systems, highlighting applications across customer service, software development, manufacturing automation, personalized education, financial trading, and healthcare. We further discuss the primary challenges posed by LLM-powered agents, including high inference latency, output uncertainty, lack of evaluation metrics, and security vulnerabilities, and propose potential solutions to mitigate these concerns.
摘要
大型语言模型(LLMs)的出现重塑了智能体系统。与传统任务范围有限的基于规则的智能体不同,基于LLM的智能体具有更高的灵活性、跨领域推理能力和自然语言交互特性。此外,随着多模态LLM的整合,当前智能体系统能够高效处理包括文本、图像、音频和结构化表格数据在内的多种数据模态,从而实现更丰富且更具适应性的现实世界行为。本文系统考察了智能体系统从前LLM时代到当前基于LLM架构的演进历程,将智能体系统划分为软件型、物理型和自适应混合型三类,重点阐述了其在客户服务、软件开发、制造自动化、个性化教育、金融交易和医疗健康等领域的应用。我们进一步探讨了基于LLM的智能体面临的主要挑战,包括高推理延迟、输出不确定性、评估指标缺失和安全漏洞等问题,并提出了缓解这些问题的潜在解决方案。
TrialPanorama: Database and Benchmark for Systematic Review and Design of Clinical Trials
Abstract
arXiv:2505.16097v1 Announce Type: new Abstract: Developing artificial intelligence (AI) for vertical domains requires a solid data foundation for both training and evaluation. In this work, we introduce TrialPanorama, a large-scale, structured database comprising 1,657,476 clinical trial records aggregated from 15 global sources. The database captures key aspects of trial design and execution, including trial setups, interventions, conditions, biomarkers, and outcomes, and links them to standard biomedical ontologies such as DrugBank and MedDRA. This structured and ontology-grounded design enables TrialPanorama to serve as a unified, extensible resource for a wide range of clinical trial tasks, including trial planning, design, and summarization. To demonstrate its utility, we derive a suite of benchmark tasks directly from the TrialPanorama database. The benchmark spans eight tasks across two categories: three for systematic review (study search, study screening, and evidence summarization) and five for trial design (arm design, eligibility criteria, endpoint selection, sample size estimation, and trial completion assessment). The experiments using five state-of-the-art large language models (LLMs) show that while general-purpose LLMs exhibit some zero-shot capability, their performance is still inadequate for high-stakes clinical trial workflows. We release TrialPanorama database and the benchmark to facilitate further research on AI for clinical trials.
摘要
开发垂直领域人工智能(AI)需要建立坚实的训练与评估数据基础。本研究推出TrialPanorama——一个包含1,657,476条临床试验记录的大规模结构化数据库,这些记录聚合自全球15个数据源。该数据库完整捕获试验设计与执行的关键要素,包括试验方案、干预措施、适应症、生物标志物及结局指标,并将其与DrugBank、MedDRA等标准生物医学本体进行关联。这种基于本体的结构化设计使TrialPanorama能作为统一的、可扩展的资源平台,支持包括试验规划、设计与总结在内的多种临床试验任务。为验证其实用性,我们直接从TrialPanorama数据库衍生出一套基准测试任务,涵盖两大类别共八项任务:系统评价类(研究检索、研究筛选与证据总结)三项,试验设计类(分组设计、入排标准、终点选择、样本量估算与试验完成度评估)五项。采用五种前沿大语言模型(LLM)的实验表明,尽管通用LLM展现出一定的零样本能力,但其性能仍无法满足高风险的临床试验工作流程需求。我们公开TrialPanorama数据库及基准测试,以促进临床试验AI的深入研究。
Sudoku-Bench: Evaluating creative reasoning with Sudoku variants
Abstract
arXiv:2505.16135v1 Announce Type: new Abstract: Existing reasoning benchmarks for large language models (LLMs) frequently fail to capture authentic creativity, often rewarding memorization of previously observed patterns. We address this shortcoming with Sudoku-Bench, a curated benchmark of challenging and unconventional Sudoku variants specifically selected to evaluate creative, multi-step logical reasoning. Sudoku variants form an unusually effective domain for reasoning research: each puzzle introduces unique or subtly interacting constraints, making memorization infeasible and requiring solvers to identify novel logical breakthroughs (``break-ins''). Despite their diversity, Sudoku variants maintain a common and compact structure, enabling clear and consistent evaluation. Sudoku-Bench includes a carefully chosen puzzle set, a standardized text-based puzzle representation, and flexible tools compatible with thousands of publicly available puzzles -- making it easy to extend into a general research environment. Baseline experiments show that state-of-the-art LLMs solve fewer than 15% of puzzles unaided, highlighting significant opportunities to advance long-horizon, strategic reasoning capabilities.
摘要
现有针对大语言模型(LLM)的推理基准测试往往无法捕捉真正的创造力,通常仅奖励对已知模式的记忆。为弥补这一缺陷,我们提出Sudoku-Bench——一个精心设计的数独变体基准测试集,专门用于评估创造性、多步骤逻辑推理能力。数独变体构成了推理研究中异常有效的领域:每个谜题都包含独特或微妙互动的约束条件,使得记忆失效,并要求求解者发现新颖的逻辑突破口("破局点")。尽管具有多样性,数独变体仍保持着统一紧凑的结构,可实现清晰一致的评估。Sudoku-Bench包含精心挑选的谜题集、标准化的文本谜题表示法,以及与数千个公开谜题兼容的灵活工具,便于扩展为通用研究环境。基线实验表明,最先进的LLM在无辅助情况下仅能解决不足15%的谜题,这为推进长程战略推理能力提供了重要研究空间。
How Memory Management Impacts LLM Agents: An Empirical Study of Experience-Following Behavior
Abstract
arXiv:2505.16067v1 Announce Type: new Abstract: Memory is a critical component in large language model (LLM)-based agents, enabling them to store and retrieve past executions to improve task performance over time. In this paper, we conduct an empirical study on how memory management choices impact the LLM agents' behavior, especially their long-term performance. Specifically, we focus on two fundamental memory operations that are widely used by many agent frameworks-addition, which incorporates new experiences into the memory base, and deletion, which selectively removes past experiences-to systematically study their impact on the agent behavior. Through our quantitative analysis, we find that LLM agents display an experience-following property: high similarity between a task input and the input in a retrieved memory record often results in highly similar agent outputs. Our analysis further reveals two significant challenges associated with this property: error propagation, where inaccuracies in past experiences compound and degrade future performance, and misaligned experience replay, where outdated or irrelevant experiences negatively influence current tasks. Through controlled experiments, we show that combining selective addition and deletion strategies can help mitigate these negative effects, yielding an average absolute performance gain of 10% compared to naive memory growth. Furthermore, we highlight how memory management choices affect agents' behavior under challenging conditions such as task distribution shifts and constrained memory resources. Our findings offer insights into the behavioral dynamics of LLM agent memory systems and provide practical guidance for designing memory components that support robust, long-term agent performance. We also release our code to facilitate further study.
摘要
记忆是基于大语言模型(LLM)智能体的关键组件,使其能够存储和检索过往执行记录,从而随时间推移提升任务表现。本文通过实证研究探讨了记忆管理策略如何影响LLM智能体行为,尤其是其长期性能。我们重点研究了当前多数智能体框架广泛采用的两种基础记忆操作——添加(将新经验纳入记忆库)和删除(选择性移除过往经验)——系统分析其对智能体行为的影响。定量研究表明,LLM智能体表现出"经验跟随"特性:当任务输入与检索记忆记录的输入高度相似时,智能体输出往往也高度相似。分析进一步揭示了该特性引发的两大挑战:错误传播(过往经验中的错误累积导致未来性能下降)与错位经验回放(过时或无关经验对当前任务产生负面影响)。通过控制实验,我们发现结合选择性添加与删除策略能有效缓解这些负面效应,相比简单记忆增长策略平均可获得10%的绝对性能提升。此外,我们还阐明了在任务分布变化和内存资源受限等挑战条件下,记忆管理选择如何影响智能体行为。本研究揭示了LLM智能体记忆系统的行为动力学特征,为设计支持稳健长期性能的记忆组件提供了实践指导。我们同步公开代码以促进后续研究。